MaAsLin 3 is the next generation of MaAsLin (Microbiome Multivariable Association with Linear Models).
MaAsLin3 is comprehensive R package for efficiently determining multivariable association between clinical metadata and microbial meta-omics features. MaAsLin3 relies on general linear models to accommodate most modern epidemiological study designs, including cross-sectional and longitudinal, along with a variety of filtering, normalization, and transform methods.
If you use the MaAsLin 3 software, please cite our manuscript:
William A. Nickols, Jacob T. Nearing, Kelsey N. Thompson, Jiaxian Shen, Curtis Huttenhower MaAsLin 3: Refining and extending generalized multivariate linear models for meta-omic association discovery. (In progress).
If you have questions, please direct it to: MaAsLin 3 Forum .
R is a programming language specializing in statistical computing and graphics. You can use R just the same as any other programming languages, but it is most useful for statistical analyses, with well-established packages for common tasks such as linear modeling, ’omic data analysis, machine learning, and visualization. We have created a short introduction tutorial here, but we note there are many other more comprehensive tutorials out there to learn R.
You can download and install the free R software environment here. Note that you should download the latest release - this will ensure the R version you have is compatible with MaAsLin 3.
RStudio is a freely available IDE (integrated development environment) for R. It is a “wrapper” around R with some additional functionalities that makes programming in R a bit easier. Feel free to download RStudio and explore its interface - but it is not required for this tutorial.
If you already have R installed, then it is possible that the R
version you have does not support MaAsLin 3. The easiest way to check
this is to launch R/RStudio, and in the console (“Console” pane in
RStudio), type in the following command (without the >
symbol):
sessionInfo()
The first line of output message should indicate your current R version. For example:
> sessionInfo()
R version 4.4.1 (2024-06-14)
For MaAsLin 3 to install, you will need R >= 4.3. If your version
is older than that, please refer to section Installing R for the first
time to download the latest R. Note that RStudio users will need to
have RStudio “switch” to this later version once it is installed. This
should happen automatically for Windows and Mac OS users when you
relaunch RStudio. For Linux users you might need to bind the correct R
executable. For more information refer to here.
Either way, once you have the correct version installed, launch the
software and use sessionInfo() to make sure that you indeed
have R >= 4.3.
The latest development version of MaAsLin 3 can be installed from
GitHub using the devtools package.
if (!require('devtools', character.only = TRUE)) {
install.packages('devtools')
}
library("devtools")
install_github("biobakery/MaAsLin3")
library("maaslin3")
MaAsLin3 requires two input files.
The data file can contain samples not included in the metadata file (along with the reverse case). For both cases, those samples not included in both files will be removed from the analysis. Also the samples do not need to be in the same order in the two files.
Example input files can be found in the inst/extdata
folder of the MaAsLin 3 source. The files provided were generated from
the HMP2 data which can be downloaded from https://ibdmdb.org/ .
# Read abundance table
taxa_table_name <- system.file("extdata", "HMP2_taxonomy.tsv", package = "maaslin3")
taxa_table <- read.csv(taxa_table_name, sep = '\t')
rownames(taxa_table) <- taxa_table$ID; taxa_table$ID <- NULL
taxa_table <- data.frame(t(taxa_table)) # Transpose to get samples x features (works either way)
# Read metadata table
metadata_name <- system.file("extdata", "HMP2_metadata.tsv", package = "maaslin3")
metadata <- read.csv(metadata_name, sep = '\t')
rownames(metadata) <- metadata$ID; metadata$ID <- NULL
# Factor the categorical variables to test against controls
metadata$diagnosis <-
factor(metadata$diagnosis, levels = c('nonIBD', 'UC', 'CD'))
metadata$dysbiosis_state <-
factor(metadata$dysbiosis_state, levels = c('none', 'dysbiosis_UC', 'dysbiosis_CD'))
metadata$antibiotics <-
factor(metadata$antibiotics, levels = c('No', 'Yes'))
taxa_table[1:5, 1:5]
## Phocaeicola_vulgatus Faecalibacterium_prausnitzii
## CSM5FZ3N_P 0.4265226 0.060255109
## CSM5FZ3R_P 0.5369584 0.007396904
## CSM5FZ3T_P 0.5911821 0.000000000
## CSM5FZ3V_P 0.2661378 0.029680329
## CSM5FZ3X_P 0.6601039 0.003596740
## Bacteroides_uniformis Prevotella_copri_clade_A Bacteroides_stercoris
## CSM5FZ3N_P 0.2692411314 0.000000e+00 0.000000000
## CSM5FZ3R_P 0.2526048469 0.000000e+00 0.008390958
## CSM5FZ3T_P 0.0000000000 0.000000e+00 0.000000000
## CSM5FZ3V_P 0.4004265260 0.000000e+00 0.000000000
## CSM5FZ3X_P 0.0008804283 1.308102e-05 0.001335669
metadata[1:5, 1:5]
## participant_id site_name week_num reads diagnosis
## CSM5FZ3N_P C3001 Cedars-Sinai 0 9961743 CD
## CSM5FZ3R_P C3001 Cedars-Sinai 2 16456391 CD
## CSM5FZ3T_P C3002 Cedars-Sinai 0 10511448 CD
## CSM5FZ3V_P C3001 Cedars-Sinai 6 17808965 CD
## CSM5FZ3X_P C3002 Cedars-Sinai 2 13160893 CD
HMP2_taxonomy.tsv: is a tab-delimited file with species
as rows and samples as columns. It is a subset of the taxonomy file that
includes just some of the the species abundances for most samples.
HMP2_metadata.tsv: is a tab-delimited file with samples
as rows and metadata as columns. It is a subset of the metadata file
that includes just some of the fields.
Several different types of statistical models can be used for association testing in MaAsLin (e.g. simple linear, zero-inflated, count-based, etc.). We have evaluated their pros and cons in the associated manuscript, and while the default is generally appropriate for most analyses, a user might want to select a different model under some circumstances. This is true for many different microbial community data types (taxonomy or functional profiles), environments (human or otherwise), and measurements (counts or relative proportions), as long as alternative models are used with appropriately modified normalization/transformation schemes.
NEGBIN and ZINB, whereas, for non-count
(e.g. percentage, CPM, or relative abundance) input, you can use
LM and CPLM.TMM and CSS only work on counts, and they also
return normalized counts, unlike TSS and CLR.
Therefore, if your inputs are counts, you can use the above two
normalizations (i.e., TMM, CSS, or
NONE (in case the data is already normalized)) without a
further transformation (i.e. transform = NONE).CPLM requires the data to
be positive. Therefore, any transformation that produces negative values
will typically NOT work for CPLM.transform = NONE.LM is the only model that works on
both positive and negative values (following
normalization/transformation), and (as per our manuscript), it is
generally much more robust to parameter changes (which are typically
limited for non-LM models). Regarding whether to use LM,
CPLM, or other models, intuitively, CPLM or a
zero-inflated alternative should perform better in the presence of
zeroes, but based on our benchmarking, we do not have evidence that
CPLM is significantly better than LM in
practice.| Model (-m ANALYSIS_METHOD) | Data type | Normalization (-n NORMALIZATION) | Transformation (-t TRANSFORM) |
|---|---|---|---|
| LM | count and non-count | TSS,CLR, NONE |
LOG, LOGIT,
AST,NONE |
| CPLM | count and non-count | TSS, TMM, CSS,
NONE |
NONE |
| NEGBIN | count | TMM, CSS, NONE |
NONE |
| ZINB | count | TMM, CSS, NONE |
NONE |
NOTE: generally you will also need to consider what thresholds you have used to filter the data (i.e. the prevalence and abundance threshold) to reduce outliers and false positives. See section 4.3.1 Prevalence and abundance filtering in MaAsLin 2 for some more details on this.
The following command runs MaAsLin 3 on the HMP2 data, running a
multivariable regression model to test for the association between
microbial species abundance versus IBD diagnosis and dysbiosis
scores (fixed_effects = c("diagnosis", "dysbiosis")).
For any categorical variable with more than 2 levels, you will also have
to specify which variable should be the reference level by using
(reference = c("diagnosis,nonIBD")). NOTE: adding a space
between the variable and level might result in the wrong reference level
being used. Output are generated in a folder called
demo_output under the current working directory
(output = "demo_output"). In this case, the example input
data has been pre-normalized and pre-filtered, so we turn off the
default normalization and prevalence filtering as well.
fit_data = Maaslin2(input_data = input_data,
input_metadata = input_metadata,
min_prevalence = 0,
normalization = "NONE",
output = "demo_output",
fixed_effects = c("diagnosis", "dysbiosis"),
reference = c("diagnosis,nonIBD"))
If you are familiar with the data frame class in R, you can also
provide data frames for input_data and
input_metadata for MaAsLin 3, instead of file names. One
potential benefit of this approach is that it allows you to easily
manipulate these input data within the R environment.
fit_data2 = Maaslin2(input_data = df_input_data,
input_metadata = df_input_metadata,
min_prevalence = 0,
normalization = "NONE",
output = "demo_output2",
fixed_effects = c("diagnosis", "dysbiosis"),
reference = c("diagnosis,nonIBD"))
?dplyr::filter()Perhaps the most important output from MaAsLin 3 is the list of
significant associations. These are provided in
significant_results.tsv:
These are the full list of associations that pass MaAsLin 3’s significance threshold, ordered by increasing q-values. Columns are:
metadata: the variable name being associated with a
microbial feature.
feature: the microbial feature (taxon, gene,
pathway, etc.).
value: for categorical features, the specific
feature level for which the coefficient and significance of association
is being reported.
coef: the model coefficient value (effect size).
value versus the reference
category.stderr: the standard error from the model.
N: the total number of samples used in the model for
this association (since e.g. missing values can be excluded).
N.not.0: the total of number of these samples in
which the feature is non-zero.
pval: the nominal significance of this
association.
qval: the corrected significance is computed with
p.adjust with the correction method (BH, etc.).
Question: how would you interpret the first row of this table?
For each of the significant associations in
significant_results.tsv, MaAsLin 3 also generates
visualizations for inspection (boxplots for categorical variables,
scatter plots for continuous). These are named as “metadata_name.pdf”.
For example, from our analysis run, we have the visualization files
dysbiosis.pdf and diagnosis.pdf:
MaAsLin 3 generates two types of output files: data and visualization.
significant_results.tsv
all_results.tsv
significant_results.tsv, but include all
association results (instead of just the significant ones).fit_data$results.residuals.rds
maaslin2.log
heatmap.pdf
[a-z/0-9]+.pdf
If your functional tables have been produced by HUMAnN and been
normalized with the humann_renorm_table script, then you
don’t need to perform any additional normalization – the effects of
sequencing depth will have already been removed (which is what we will
be using in this tutorial). We tend to work with CPM units because we
find them to be more convenient, but they are numerically equivalent to
relative abundances for modeling purposes (CPM = RA * 1e6).
Otherwise - you will want to use the same thoughts above about model/normalization/transformation as above.
Some other quick tips for using MaAsLin 3 on functional profiles: * To reduce the number of features, you typically want to run either the unstratified functional features (or a filtered subset of the stratified features) * You may also want to remove functional features that are highly correlated to one specific taxon (i.e. likely contributed by that microbe), since these can be better-explained by taxonomic changes.
#This can also be done with with the HUMAnN 3 untiliy `humann_split_stratified_table`
unstrat_pathways <-function(dat_path){
temp = dat_path[!grepl("\\|",rownames(dat_path)),]
return(temp)
}
df_input_path = unstrat_pathways(df_input_path)
Run the same model as above on the MetaCyc pathway table generated from the bioBakery workflows.
fit_func = Maaslin2(input_data = df_input_path,
input_metadata = df_input_metadata,
output = "demo_functional",
fixed_effects = c("diagnosis", "dysbiosis"),
reference = c("diagnosis,nonIBD"),
min_abundance = 0.0001,
min_prevalence = 0.1)
Note: Here we have the samples in the columns and the features in the rows and MaAsLin 3 correctly identified this and was able to match the samples.
We have also recently published on using MaAsLin 3 to analyze differential features in metatranscriptomic data (MTX), for further information on that please see our recent publication.
Some helpful tips and tricks (HUMAnN related) * HUMAnN regroup function * HUMAnN rename function * HUMAnN renorm function * HUMAnN split stratified table
Q: Does the output of the functional run make sense? What are some of the important features to investigate further?
Many statistical analysis are interested in testing for interaction
between certain variables. Unfortunately, MaAsLin 3 does not yet provide
a direct interface for this. Instead, the user needs to create
artificial interaction columns as additional fixed_effects
terms. Using the above fit as an example, to test for the interaction
between diagnosis and dysbiosis, I can create
two additional columns: CD_dysbiosis and
UC_dysbiosis (since the reference for
diagnosis is nonIBD):
df_input_metadata$CD_dysbiosis = (df_input_metadata$diagnosis == "CD") *
df_input_metadata$dysbiosis
df_input_metadata$UC_dysbiosis = (df_input_metadata$diagnosis == "UC") *
df_input_metadata$dysbiosis
fit_data_ixn = Maaslin2(input_data = df_input_data,
input_metadata = df_input_metadata,
min_prevalence = 0,
normalization = "NONE",
output = "demo_output_interaction",
fixed_effects = c("diagnosis", "dysbiosis", "CD_dysbiosis", "UC_dysbiosis"),
reference = c("diagnosis,nonIBD"))
Certain studies have a natural “grouping” of sample observations,
such as by subject in longitudinal designs or by family in family
designs. It is important for statistic analysis to address the
non-independence between samples belonging to the same group MaAsLin 3
provides a simple interface for this through the parameter
random_effects, where the user can specify the grouping
variable to run a mixed
effect model instead. For example, we note that HMP2 is a
longitudinal design where the same subject (column subject)
can have multiple samples. We thus ask MaAsLin 3 to use subject as its
random effect grouping variable:
fit_data_random = Maaslin2(input_data = df_input_data,
input_metadata = df_input_metadata,
min_prevalence = 0,
normalization = "NONE",
output = "demo_output_random",
fixed_effects = c("diagnosis", "dysbiosis"),
random_effects = c("subject"),
reference = c("diagnosis,nonIBD"))
If you are interested in testing the effect of time in a longitudinal
study, then the time point variable should be included in
fixed_effects during your MaAsLin 3 call.
MaAsLin 3 provide many parameter options for different data pre-processing (normalization, filtering, transfomation) and other tasks. The full list of these options are:
min_abundance
0
]min_prevalence
0.1 ]max_significance
0.25
]normalization
"TSS" ] [
Choices: "TSS", "CLR", "CSS",
"NONE", "TMM" ]transform
"LOG" ] [ Choices:
"LOG", "LOGIT", "AST",
"NONE" ]analysis_method
"LM" ] [
Choices: "LM", "CPLM", "ZICP",
"NEGBIN", "ZINB" ]correction
"BH" ]standardize
TRUE ]plot_heatmap
TRUE ]heatmap_first_n
50 ]plot_scatter
TRUE ]cores
1 ]reference
Typically, it only makes sense to test for feature-metadata
associations if a feature is non-zero “enough” of the time. “Enough” can
vary between studies, but a 10-50% minimum prevalence threshold is not
unusual (and up to 70-90% can be reasonable). Since we’re already using
a small, selected subset of demo data, we’ll use
min_prevalence = 0.1 to test only features with at least
10% non-zero values. Note: this is the default
parameter setting in MaAsLin 3.
Similarly, it’s often desirable to test only features that reach a
minimum abundance threshold in at least this many samples. By default,
MaAsLin 3 will consider any non-zero value to be reliable, and if you’ve
already done sufficient QC in your dataset, this is appropriate.
However, if you’d like to filter more or be conservative, you can set a
minimum abundance threshold like min_abundance = 0.0001 to
test only features reaching at least this (relative) abundance.
Putting these arguments together, we can run MaAsLin 3 (and it’ll go even faster!) using both:
fit_data_filter = Maaslin2(input_data = df_input_data,
input_metadata = df_input_metadata,
normalization = "NONE",
output = "demo_output_filter",
fixed_effects = c("diagnosis", "dysbiosis"),
reference = c("diagnosis,nonIBD"),
random_effects = c("subject"),
min_prevalence = 0.1,
min_abundance = 0.0001)
MaAsLin 3 can also be run with a command line interface. For example, the demo analysis can be performed with:
Maaslin2.R --transform=AST --fixed_effects="diagnosis,dysbiosis" /usr/local/lib/R/site-library/Maaslin2/extdata/HMP2_taxonomy.tsv /usr/local/lib/R/site-library/Maaslin2/extdata/HMP2_metadata.tsv demo_output --reference="diagnosis,nonIBD"
HMP2_taxonomy.tsv is the path to your data (or
features) fileHMP2_metadata.tsv is the path to your metadata
filedemo_output is the path to the folder to write the
outputFull help documentation:
$ Maaslin2.R --help
Usage: ./R/Maaslin2.R [options] <data.tsv> <metadata.tsv> <output_folder>
Options:
-h, --help
Show this help message and exit
-a MIN_ABUNDANCE, --min_abundance=MIN_ABUNDANCE
The minimum abundance for each feature [ Default: 0 ]
-p MIN_PREVALENCE, --min_prevalence=MIN_PREVALENCE
The minimum percent of samples for which a feature
is detected at minimum abundance [ Default: 0.1 ]
-s MAX_SIGNIFICANCE, --max_significance=MAX_SIGNIFICANCE
The q-value threshold for significance [ Default: 0.25 ]
-n NORMALIZATION, --normalization=NORMALIZATION
The normalization method to apply [ Default: TSS ]
[ Choices: TSS, CLR, CSS, NONE, TMM ]
-t TRANSFORM, --transform=TRANSFORM
The transform to apply [ Default: LOG ]
[ Choices: LOG, LOGIT, AST, NONE ]
-m ANALYSIS_METHOD, --analysis_method=ANALYSIS_METHOD
The analysis method to apply [ Default: LM ]
[ Choices: LM, CPLM, NEGBIN, ZINB ]
-r RANDOM_EFFECTS, --random_effects=RANDOM_EFFECTS
The random effects for the model, comma-delimited
for multiple effects [ Default: none ]
-f FIXED_EFFECTS, --fixed_effects=FIXED_EFFECTS
The fixed effects for the model, comma-delimited
for multiple effects [ Default: all ]
-c CORRECTION, --correction=CORRECTION
The correction method for computing the
q-value [ Default: BH ]
-z STANDARDIZE, --standardize=STANDARDIZE
Apply z-score so continuous metadata are
on the same scale [ Default: TRUE ]
-l PLOT_HEATMAP, --plot_heatmap=PLOT_HEATMAP
Generate a heatmap for the significant
associations [ Default: TRUE ]
-i HEATMAP_FIRST_N, --heatmap_first_n=HEATMAP_FIRST_N
In heatmap, plot top N features with significant
associations [ Default: TRUE ]
-o PLOT_SCATTER, --plot_scatter=PLOT_SCATTER
Generate scatter plots for the significant
associations [ Default: TRUE ]
-e CORES, --cores=CORES
The number of R processes to run in parallel
[ Default: 1 ]
-r REFERENCE, --reference='VARIABLE,REFERENCE'
The factor to use as a reference for a
variable with more than two levels
provided as a string of 'variable,reference'
semi-colon delimited for multiple variables.